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1  Introduction 

The  goal  of  the  TREC  Web  track  is  to  explore  and  evaluate  retrieval  ap¬ 
proaches  over  large-scale  subsets  of  the  Web  -  currently  on  the  order  of 
one  billion  pages.  For  TREC  2013,  the  fifth  year  of  the  Web  track,  we 
implemented  the  following  significant  updates  compared  to  2012. 

First,  the  Diversity  task  was  replaced  with  a  new  Risk-sensitive  retrieval 
task  that  explores  the  tradeoffs  systems  can  achieve  between  effectiveness 
(overall  gains  across  queries)  and  robustness  (minimizing  the  probability 
of  significant  failure,  relative  to  a  provided  baseline).  Second,  we  based  the 
2013  Web  track  experiments  on  the  new  ClueWebl2  collection  created  by  the 
Language  Technologies  Institute  at  Carnegie  Mellon  University.  ClueWebl2 
is  a  successor  to  the  ClueWebOO  dataset,  comprising  about  one  billion  Web 
pages  crawled  between  Feb-May  2012.^  The  crawling  and  collection  process 
for  ClueWebl2  included  a  rich  set  of  seed  URLs  based  on  commercial  search 
traffic,  Twitter  and  other  sources,  and  multiple  measures  for  flagging  unde¬ 
sirable  content  such  as  spam,  pornography,  and  malware.  The  Adhoc  task 
continued  as  in  previous  years. 

Both  the  Adhoc  and  Risk-sensitive  tasks  used  a  common  topic  set  of 
50  new  topics,  and  differed  only  in  their  evaluation  methodology.  With 
the  goal  of  reflecting  aspects  of  authentic  Web  usage,  the  Web  track  topics 
were  again  developed  from  the  logs  and  data  resources  of  commercial  search 
engines.  However,  a  different  extraction  methodology  was  used  compared 
to  last  year.  This  year,  two  types  of  topics  were  developed:  faceted  topics, 

^Details  on  ClueWebl2  are  available  at  http://boston.lti.cs.cmu.edu/cluewebl2 
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and  unfaceted  (single-facet)  topics.  Faceted  topics  were  more  like  “head” 
queries,  and  structured  as  having  a  representative  set  of  subtopics,  with  each 
subtopic  corresponding  to  a  popular  subintent  of  the  main  topic.  The  faceted 
topic  queries  were  less  directly  ambiguous  than  last  year;  the  ambiguity  lay 
more  in  which  subintents  were  likely  to  be  most  relevant  to  users  and  not  in 
the  direct  interpretation  of  the  query.  Unfaceted  (single- facet)  topics  were 
intended  to  be  more  like  “tail”  queries  with  a  clear  question  or  intent.  For 
faceted  topics,  query  clusters  were  developed  and  used  by  NIST  for  topic 
development.  Only  the  base  query  was  released  to  participants  initially: 
the  topic  structures  containing  subtopics  and  single-  vs  multi-faceted  vs. 
topic  type  were  only  released  after  runs  were  submitted.  This  was  done  to 
avoid  biases  that  might  be  caused  by  revealing  extra  information  about  the 
information  need  that  may  not  be  available  to  Web  search  systems  as  part 
of  the  actual  retrieval  process. 

The  Adhoc  task  judged  documents  with  respect  to  the  topic  as  a  whole. 
Relevance  levels  are  similar  to  the  levels  used  in  commercial  Web  search, 
including  a  spam/junk  level.  The  top  two  levels  of  the  assessment  structure 
are  related  to  the  older  Web  track  tasks  of  homepage  finding  and  topic 
distillation.  Subtopic  assessment  was  also  performed  for  the  faceted  topics, 
as  described  further  in  Section  3. 

Table  1  summarizes  participation  in  the  TREC  2013  Web  Track.  Overall, 
we  received  61  runs  from  15  groups;  34  ad  hoc  runs  and  27  risk-sensitive 
runs.  The  number  of  participants  in  the  Web  track  increased  slightly  over 
2012  (when  12  groups  participated,  submitting  48  runs),  including  some 
institutions  that  had  not  previously  participated.  Eight  runs  were  manual 
runs,  submitted  across  four  groups:  all  other  runs  were  automatic  with 
no  human  intervention.  Nine  of  the  runs  used  the  Category  B  subset  of 
ClueWebl2:  all  other  runs  used  the  main  Category  A  corpus. 
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2  ClueWebl2  Category  A  and  B  collections 


As  with  ClueWeb09,  the  ClueWebl2  collection  comes  with  two  datasets: 
Category  A,  and  Category  B.  The  Category  A  dataset  is  the  main  corpus 
and  contains  about  733  million  documents  (27.3  TB  uncompressed,  5.54 
TB  compressed).  The  Category  B  dataset  is  a  sample  from  Category  A, 
containing  about  52  million  documents,  or  about  7%  of  the  Category  A 
total.  Details  on  how  the  Category  A  and  B  collections  were  created  may  be 
found  on  the  Lemur  project  website^.  We  strongly  encouraged  participants 
to  use  the  full  Category  A  data  set  if  possible.  Results  in  the  paper  are 
labeled  by  their  collection  category. 

3  Topics 

NIST  created  and  assessed  50  new  topics  for  the  Web  track.  Unlike  TREC 
2012,  the  TREC  2013  Web  track  included  a  significant  proportion  of  more 
focused  topics  designed  to  represent  more  specific,  less  frequent,  possibly 
more  difficult  queries.  To  retain  the  Web  flavor  of  queries  in  this  track,  we 
retain  the  notion  from  last  year  that  some  topics  may  be  multi-faceted,  i.e. 
broader  in  intent  and  thus  structured  as  a  representative  set  of  subtopics, 
each  related  to  a  different  potential  aspect  of  user  need.  Examples  are 
provided  below.  For  topics  with  multiple  subtopics,  documents  were  judged 
with  respect  to  the  subtopics.  For  each  subtopic,  NIST  assessors  made 
a  scaled  six-point  judgment  as  to  whether  or  not  the  document  satisfied 
the  information  need  associated  with  the  subtopic.  For  those  topics  with 
multiple  subtopics,  the  set  of  subtopics  was  intended  to  be  representative, 
not  exhaustive. 

Subtopics  were  based  on  information  extracted  from  the  logs  of  a  com¬ 
mercial  search  engine.  Topics  having  multiple  subtopics  had  subtopics  se¬ 
lected  roughly  by  overall  popularity,  which  was  achieved  using  combined 
query  suggestion  and  completion  data  from  two  commercial  search  engines. 

In  this  way,  the  focus  was  retained  on  a  balanced  set  of  popular  subtopics, 
while  limiting  the  occurrence  of  strange  and  unusual  interpretations  of  subtopic 
aspects.  Single-facet  topic  candidates  were  developed  based  on  queries  ex¬ 
tracted  from  search  log  data  that  were  low-frequency  (‘tail-like’)  but  issued 
by  multiple  users;  less  than  10  terms  in  length;  and  relatively  low  effective¬ 
ness  scores  across  multiple  commercial  search  engines  (as  of  January  2013). 

The  topic  structure  was  similar  to  that  used  for  the  TREC  2009  topics. 

■^http :  / /lemurpro  j  ect .  org/ cluewebl2/ specs .  php 
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Examples  of  single-facet  topics  include: 

<topic  n\iinber="227"  type="single"> 

<query>i  will  survive  lyrics</query> 

<description>Find  the  lyrics  to  the  song  "I  Will  Survive" . </description> 

</topic> 

<topic  number="229"  type="single"> 

<query>beef  strogEuioff  recipe</query> 

<description> 

Find  complete  (not  partial)  recipes  for  beef  stroganoff . 

</description> 

</topic> 

Examples  of  faceted  topics  include: 

<topic  number="235"  type="f aceted"> 

<query>ham  radio</query> 

<description>How  do  you  get  a  ham  radio  license?</description> 

<subtopic  number="l"  type="inf ">How  do  you  get  a  ham  radio  license?</subtopic> 

<subtopic  number="2"  type="nav">What  are  the  ham  radio  license  classes?</subtopic> 

<subtopic  number="3"  type="inf ">How  do  you  build  a  ham  radio  station?</subtopic> 

<subtopic  number="4"  type="inf ">Find  information  on  ham  radio  antennas . </subtopic> 

<subtopic  number="5"  type="nav">What  are  the  ham  radio  call  signs?</subtopic> 

<subtopic  number="6"  type="nav">Find  the  web  site  of  Ham  Radio  Outlet . </subtopic> 

</topic> 

<topic  number="245"  type="f aceted"> 

<query>roosevelt  island</query> 

<description>What  restaurants  are  on  Roosevelt  Island  (NY) ?</description> 

<subtopic  number="l"  type="inf ">What  restaurants  are  on  Roosevelt  Island  (NY) ?</subtopic> 
<subtopic  number="2"  type="nav">Find  the  Roosevelt  Island  tram  schedule . </subtopic> 

<subtopic  number="3"  type="inf ">What  is  the  history  of  the  Roosevelt  Island  tram?</subtopic> 
<subtopic  number="4"  type="nav">Find  a  map  of  Roosevelt  Island  (NY) . </subtopic> 

<subtopic  number="5"  type="inf"> 

Find  real  estate  listings  for  Roosevelt  Island  (NY) . 

</subtopic> 

Initial  topic  release  to  participants  included  only  the  query  field,  as 
shown  in  the  excerpt  here: 

201 : raspberry  pi 

202:uss  carl  vinson 

203: reviews  of  les  miserables 
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204: rules  of  golf 

205: average  charitable  donation 

As  shown  in  the  above  examples,  those  topics  with  a  clear  focused  intent 
have  a  single  subtopic.  Topics  with  multiple  subtopics  reflect  underspec¬ 
ified  queries,  with  different  aspects  covered  by  the  subtopics.  We  assume 
that  a  user  interested  in  one  aspect  may  still  be  interested  in  others.  Each 
subtopic  was  categorized  as  being  either  navigational  (“nav”)  or  informa¬ 
tional  (“inf”).  A  navigational  subtopic  usually  has  only  a  small  number 
of  relevant  pages  (often  one).  For  these  subtopics,  we  assume  the  user  is 
seeking  a  page  with  a  specific  URL,  such  as  an  organization’s  homepage. 
On  the  other  hand,  an  informational  query  may  have  a  large  number  of  rel¬ 
evant  pages.  For  these  subtopics,  we  assume  the  user  is  seeking  information 
without  regard  to  its  source,  provided  that  the  source  is  reliable. 

For  the  adhoc  task,  relevance  is  judged  on  the  basis  of  the  description 
field.  Thus,  the  first  subtopic  is  always  identical  to  the  description  sentence. 

4  Methodology  and  Measures 

4.1  Pooling  and  Jndging 

For  each  topic,  participants  in  the  adhoc  and  risk-sensitive  tasks  submitted 
a  ranking  of  the  top  10,000  results  for  that  topic.  All  submitted  runs  were 
included  in  the  pool  for  judging.  A  common  pool  was  used. 

For  the  risk-sensitive  task,  new  versions  of  ndeval  and  gdeval  that 
supported  the  risk-sensitive  versions  of  the  evaluation  measures  (described 
below)  were  provided  to  NIST. 

All  data  and  tools  required  for  evaluation,  including  the  scoring  programs 
ndeval  and  gdeval  as  well  as  the  baseline  run  used  in  computation  of  the  risk- 
sensitive  scores  (run  results-cata-filtered.txt)  were  available  in  the  track’s 
github  distribution^. 

The  relevance  judgment  for  a  page  was  one  of  a  range  of  values  as  de¬ 
scribed  in  Section  4.2.  The  topic-aspect  combinations  with  zero  known 
relevant  documents  were  eliminated  from  all  the  evaluations.  These  are: 
topic  202,  aspect  2 
topic  202,  aspect  3 
topic  216,  aspect  2 
topic  225,  aspect  1 
topic  225,  aspect  5 

^http : //github . com/trec-web/trec-web-2013 
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topic  244,  aspect  2 
topic  244,  aspect  3 

For  topics  that  had  a  single  aspect  in  the  original  topics  file,  that  one 
aspect  is  used.  For  all  other  topics  except  topic  225,  aspect  number  1  is  the 
single  aspect.  For  topic  225,  aspect  number  2  is  the  single  aspect  (aspect  2  is 
used  instead  of  aspect  1  because  aspect  1  has  no  known  relevant  documents) . 

Different  topics  were  pooled  to  different  depths  because  the  original 
depth  (20)  resulted  in  too  many  documents  to  be  judged  in  the  allotted 
amount  of  assessing  time.  Smaller  pools  were  built  using  a  depth  of  10.  In 
all  cases,  the  pools  were  built  over  all  submitted  runs.  Pools  were  sorted 
so  that  pages  from  the  same  site  (as  determined  by  URL  syntax)  were  con¬ 
tiguous  in  the  pool.  For  multi-aspect  topics,  assessors  judged  a  given  page 
against  all  aspects  before  moving  to  the  next  page  in  the  pool.  Topics  judged 

to  depth  20  were:  201,  202,  203,  204,  205,  206,  208,  210,  211,  212,  214,  215, 

216,  217,  218,  219,  220,  221,  223,  224,  225,  226,  232,  233,  234,  239,  240,  243, 

247.  Topics  judged  to  depth  10  were:  207,  209,  213,  222,  227,  228,  229,  230, 

231,  235,  236,  237,  238,  241,  242,  244,  245,  246,  248,  249,  250. 

4.2  Ad-hoc  Retrieval  Task 

An  ad-hoc  task  in  TREC  provides  the  basis  for  evaluating  systems  that 
search  a  static  set  of  documents  using  previously-unseen  topics.  The  goal 
of  an  ad-hoc  task  is  to  return  a  ranking  of  the  documents  in  the  collection 
in  order  of  decreasing  probability  of  relevance.  The  probability  of  relevance 
for  a  document  is  considered  independently  of  other  documents  that  appear 
before  it  in  the  result  list.  For  the  ad-hoc  task,  documents  are  judged  on 
the  basis  of  the  description  field  using  a  six-point  scale,  defined  as  follows: 

1.  Nav:  This  page  represents  a  home  page  of  an  entity  directly  named 
by  the  query;  the  user  may  be  searching  for  this  specific  page  or  site, 
(relevance  grade  4) 

2.  Key:  This  page  or  site  is  dedicated  to  the  topic;  authoritative  and 
comprehensive,  it  is  worthy  of  being  a  top  result  in  a  web  search  engine, 
(relevance  grade  3) 

3.  HRel:  The  content  of  this  page  provides  substantial  information  on 
the  topic,  (relevance  grade  3) 

4.  Rel:  The  content  of  this  page  provides  some  information  on  the  topic, 
which  may  be  minimal;  the  relevant  information  must  be  on  that  page. 
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not  just  promising- looking  anchor  text  pointing  to  a  possibly  useful 
page,  (relevance  grade  1) 

5.  Non:  The  content  of  this  page  does  not  provide  useful  information  on 
the  topic,  but  may  provide  useful  information  on  other  topics,  includ¬ 
ing  other  interpretations  of  the  same  query,  (relevance  grade  0) 

6.  Junk:  This  page  does  not  appear  to  be  useful  for  any  reasonable  pur¬ 
pose;  it  may  be  spam  or  junk  (relevance  grade  -2). 

After  each  description  we  list  the  relevance  grade  assigned  to  that  level 
as  they  appear  in  the  judgment  (qrels)  file.  These  relevance  grades  are  also 
used  for  calculating  graded  effectiveness  measures,  except  that  a  value  of  -2 
is  treated  as  0  for  this  purpose.  For  binary  effectiveness  measures,  we  treat 
grades  1/2/3/4  as  relevant  and  grades  0/-2  as  non-relevant. 

The  primary  effectiveness  measure  for  the  ad-hoc  task  is  expected  recipro- 
eal  rank  (ERR)  as  defined  by  Chapelle  et  al.  [1] .  We  also  report  a  variant  of 
NDCG  [3]  as  well  as  standard  binary  effectiveness  measures,  including  mean 
average  precision  (MAP)  and  precision  at  rank  k  (P@k).  To  account  for  the 
faceted  topics,  we  also  report  diversity-based  versions  of  these  measures: 
intent-aware  expected  reciprocal  rank  (ERR-IA)  [1]  and  a-nDCDG  [2]. 

Figure  1  summarizes  the  per-topic  variability  in  ERR@10  across  all  sub¬ 
mitted  runs.  Figure  2  shows  the  variability  in  ERR@10  for  two  specific 
top-ranked  runs,  from  the  Technion  and  University  of  Glasgow,  with  base¬ 
line  included  for  comparison. 

4.3  Risk-sensitive  Retrieval  Task 

The  new  risk-sensitive  retrieval  task  for  Web  evaluation  rewards  algorithms 
that  not  only  achieve  improvements  in  average  effectiveness  across  topics  (as 
in  the  ad-hoc  task),  but  also  maintain  good  robustness,  which  we  define  as 
minimizing  the  risk  of  significant  failure  relative  to  a  given  baseline. 

Search  engines  use  increasingly  sophisticated  stages  of  retrieval  in  their 
quest  to  improve  result  quality:  from  personalized  and  contextual  re-ranking 
to  automatic  query  reformulation.  These  algorithms  aim  to  increase  retrieval 
effectiveness  on  average  across  queries,  compared  to  a  baseline  ranking  that 
does  not  use  such  operations.  However,  these  operations  are  also  risky  since 
they  carry  the  possibility  of  failure  -  that  is,  making  the  results  worse  than  if 
they  had  not  been  used  at  all.  The  goal  of  the  risk-sensitive  task  is  two-fold: 
1)  To  encourage  research  on  algorithms  that  go  beyond  just  optimizing  av¬ 
erage  effectiveness  in  order  to  effectively  optimize  both  effectiveness  and  ro- 
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bustness,  and  achieve  effective  tradeoffs  between  these  two  competing  goals; 
and  2)  to  explore  effective  risk-aware  evaluation  criteria  for  such  systems. 

The  risk-sensitive  retrieval  track  is  related  to  the  goals  of  the  earlier 
TREC  Robust  Track  (TREC  2004,  2005),^  which  focused  on  increasing  re¬ 
trieval  effectiveness  for  poorly-performing  topics  using  evaluation  measures 
such  as  geometric  MAP  that  focused  on  maximizing  the  average  improve¬ 
ment  on  the  most  difficult  topics.  The  risk-sensitive  retrieval  track  can  be 
thought  of  as  a  next  step  in  exploring  more  general  retrieval  objectives  and 
evaluation  measures  that  (a)  explicitly  account  for,  and  can  differentiate 
systems  based  on,  differences  in  variance  or  other  risk-related  statistics  of 
the  win/loss  distribution  across  topics  for  a  single  run,  (b)  the  quality  of  the 
curve  derived  from  a  set  of  tradeoffs  between  effectiveness  and  robustness 
achievable  by  systems,  measured  across  multiple  runs  at  different  average 
effectiveness  levels,  and  (c)  computing  (a)  and  (b)  by  accounting  for  the  effec¬ 
tiveness  of  a  competing  baseline  (both  standard,  and  participant-supplied) 
as  a  factor  in  optimizing  retrieval  performance. 

As  a  standard  baseline,  we  used  a  pseudo-relevance  feedback  run  as  im¬ 
plemented  by  the  Indri  retrieval  engine.^  Specifically,  for  each  query,  we 
used  10  feedback  documents,  20  feedback  terms,  and  a  linear  interpolation 
weight  of  0.60  with  the  original  query.  Additionally,  we  used  the  Waterloo 
spam  classiher  to  filter  out  all  documents  with  a  percentile-score  less  than 
70.6. 

As  with  the  adhoc  task,  we  use  Intent- Aware  Expected  Reciprocal  Rank 
(ERR-IA)  as  the  basic  measure  of  retrieval  effectiveness,  and  per-query  re¬ 
trieval  delta  is  dehned  as  the  absolute  difference  in  effectiveness  between  a 
contributed  run  and  the  above  standard  baseline  run,  for  a  given  query.  A 
positive  delta  means  a  win  for  the  system  on  that  query,  and  negative  delta 
means  a  loss.  We  also  report  other  flavors  of  the  risk-related  measure  based 
on  NDCG.  Eor  single  runs,  the  following  will  be  the  main  risk-sensitive  eval¬ 
uation  measure.  Let  A(g)  =  Ra{q)  —  Rbase{q)  be  the  absolute  win  or  loss 
for  query  q  with  system  retrieval  effectiveness  Ra{(1)  relative  to  the  base¬ 
line’s  effectiveness  Rbase{q)  for  the  same  query.  We  categorize  the  outcome 
for  each  query  q  in  the  set  Q  of  all  N  queries  according  to  the  sign  of  A(g), 
giving  three  categories: 

Hurt  Queries  {Q-)  have  A(g)  <  0;  Unchanged  Queries  (Qo)  have  A(g)  =  0; 
Improved  Queries  (<5+)  have  A(g)  >  0. 

“^http :  //tree  .nist .  gov/data/robust/04  .guidelines  .html 
®http : //www. lemurproject . org/indri/ 

®http : //www. manse i .uwaterloo . ea/~msmueker/ewl2spam/ 


The  risk-sensitive  utility  measure  Urisk{Q)  of  a  system  over  the  set  of 
queries  Q  is  defined  as: 

Urisk{Q)  =  l/N  •  [S,6Q^A(g)  -  (a  +  1)S,6q.  A(g)]  (1) 

where  a  is  the  key  risk-aversion  parameter.  In  words,  this  rewards  systems 
that  maximize  average  effectiveness,  but  also  penalizes  losses  relative  to  the 
baseline  results  for  the  same  query,  weighting  losses  a  +  1  times  as  heavily 
as  successes.  When  the  risk  aversion  parameter  a  is  large,  a  system  will 
become  more  conservative  and  put  more  emphasis  on  avoiding  large  losses 
relative  to  the  baseline.  When  a  is  small,  a  system  will  tend  to  ignore  the 
baseline.  The  adhoc  task  objective,  maximizing  only  average  effectiveness 
across  queries,  corresponds  to  the  special  case  a  =  0.  Details  are  given  in 
Appendix  A  of  the  TREC  Web  2013  Guidelines^. 

Table  2:  Top  ad-hoc  task  results  ordered  by  ERR@10.  Only  the  best  run  according 
to  ERR@10  from  each  group  is  included  in  the  ranking. 


Group 

Run 

Cat 

Type 

ERROlO 

nDCGOlO 

Technion 

clustmrfaf 

A 

auto 

0.175 

0.298 

udeLfang 

UDInfolabWEB2 

A 

auto 

0.167 

0.284 

uogTr 

uogTrAIwLmb 

A 

auto 

0.151 

0.247 

udel 

udelManExp 

A 

manual 

0.150 

0.241 

ICTNET 

ICTNET13RSR2 

A 

auto 

0.149 

0.224 

ut 

ut22xact 

A 

auto 

0.144 

0.230 

diro_web_13 

udemQlmllEbWiki 

A 

auto 

0.143 

0.255 

wistud 

wistud. runD 

A 

manual 

0.125 

0.215 

CWI 

cwiwtl3cps 

A 

auto 

0.121 

0.211 

UJS 

UJS13LCRAd2 

B 

auto 

0.100 

0.155 

webis 

webisrandom 

A 

auto 

0.093 

0.171 

RMIT 

RMITSC75 

A 

auto 

0.093 

0.172 

Organizers 

baseline 

A 

auto 

0.088 

0.162 

MSR_Redmond 

msr_alpha0_95_4 

A 

manual 

0.087 

0.157 

UWaterlooCLAC 

UWCWEB13RISK02 

A 

auto 

0.080 

0.134 

DLDE 

dlde 

B 

manual 

0.008 

0.009 

^http:  / /research. microsoft  .com  /  en-us  /  projects  /  trec-web-2013/ 
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Table  3:  Top  ad-hoc  task  results  ordered  by  ERR@20.  Only  the  best  run  according 
to  ERR(§)20  from  each  group  is  included  in  the  ranking. 


Group 

Run 

Cat 

Type 

ERR@20 

nDCG@20 

Technion 

clustmrfaf 

A 

auto 

0.184 

0.310 

udeLfang 

UDInfolabWEB2 

A 

auto 

0.176 

0.282 

uogTr 

uogTrAIwLmb 

A 

auto 

0.160 

0.259 

ICTNET 

ICTNET13RSR2 

A 

auto 

0.158 

0.236 

udel 

udelManExp 

A 

manual 

0.157 

0.246 

ut 

ut22xact 

A 

auto 

0.152 

0.228 

diro_web_13 

udemQlmllEbWiki 

A 

auto 

0.152 

0.254 

wistud 

wistud. runD 

A 

manual 

0.134 

0.225 

CWI 

cwiwtl3cps 

A 

auto 

0.128 

0.218 

UJS 

UJS13LCRAd2 

B 

auto 

0.107 

0.148 

RMIT 

RMITSCTh 

A 

auto 

0.102 

0.179 

webis 

webisrandom 

A 

auto 

0.101 

0.181 

MSR_Redmond 

msr_alpha0_95_4 

A 

manual 

0.097 

0.175 

Organizers 

baseline 

A 

auto 

0.096 

0.168 

UWaterlooCLAC 

UWCWEB13RISK02 

A 

auto 

0.085 

0.132 

DLDE 

dlde 

B 

manual 

0.008 

0.007 

5  Conclusions  and  Future  Plans 

The  Web  track  will  continue  for  a  sixth  year  in  TREC  2014,  using  sub¬ 
stantially  the  same  tasks  and  methodology  as  this  year,  but  with  potential 
adjustments  in  some  aspects.  The  following  are  known  areas  for  refinement, 
based  on  participant  feedback  and  our  experience  organizing  this  year’s  Web 
track. 


•  Improved  methodology  for  developing  and  assessing  the  more  focused, 
unfaceted  (also  referred  to  as  single-facet)  topics.  Many  of  the  un¬ 
faceted  topic  candidates  were  indeed  unambiguous,  more  tail-like  queries, 
but  a  number  had  potentially  multiple  answers  (e.g.  [dark  chocolate 
health  benefits]).  This  led  to  many  pages  being  partially  relevant,  with 
no  clear  way  for  assessors  to  know  when  it  was  complete  enough.  We 
believe  retaining  a  blend  of  more  or  less  focused  query  types  is  impor¬ 
tant  to  reflect  the  nature  of  authentic  Web  queries,  but  will  look  at 
revised  query  clusters  and  clearer  topic  development  and  assessment 
guidelines  for  unfaceted  topics. 
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Table  4:  Top  diversity  measures  on  results  ordered  by  ERR-IA@10.  Only  the  best 
run  according  to  ERR-IA@10  from  each  group  is  included  in  the  ranking. 


Group 

Run 

Cat 

Type 

ERR-IA@10 

a-nDCG@10 

NRBP 

udeLfang 

UDInfolabWEB2 

A 

auto 

0.574 

0.628 

0.547 

Technion 

clustmrfaf 

A 

auto 

0.554 

0.620 

0.521 

ICTNET 

ICTNET  13RSR3 

A 

auto 

0.542 

0.598 

0.512 

uogTr 

uogTrAIwLmb 

A 

auto 

0.539 

0.606 

0.498 

udel 

udelPseudo2 

A 

auto 

0.531 

0.612 

0.486 

ut 

ut22base 

A 

auto 

0.506 

0.574 

0.470 

wistud 

wistud. runD 

A 

manual 

0.503 

0.558 

0.466 

diro_web_13 

udemQlmllFbWiki 

A 

auto 

0.475 

0.557 

0.433 

CWf 

cwiwtl3cps 

A 

auto 

0.473 

0.531 

0.439 

UJS 

UJS13Risk2 

B 

auto 

0.461 

0.516 

0.434 

webis 

webismixed 

A 

auto 

0.409 

0.468 

0.374 

RMIT 

RMITSC75 

A 

auto 

0.376 

0.448 

0.330 

MSR_Redmond 

msr_alpha0_95_4 

A 

manual 

0.358 

0.444 

0.306 

Organizers 

baseline 

A 

auto 

0.342 

0.416 

0.294 

UWaterlooCLAC 

UWCWEB13RISK02 

A 

auto 

0.315 

0.373 

0.283 

DLDE 

dlde 

B 

manual 

0.045 

0.058 

0.038 

•  Having  a  two-stage  submission  process  to  determine  baselines,  or  user- 
supplied  baselines.  This  would  involve  ad-hoc  runs  being  submitted 
first,  followed  by  selection  of  some  of  those  runs  to  be  redistributed  to 
participants  to  use  in  the  2nd  re-ranking  task. 

•  Using  judgments  more  effectively/judging  more  queries  -  one  idea  that 
might  be  feasible  if  we  do  a  two-stage  process  is  to  specifically  tar¬ 
get  queries  where  the  submitted  ad-hoc  runs  have  high  variability.  It 
would  require  careful  thought  since  it  is  possible  the  risk-sensitive  ap¬ 
proaches  could  still  degrade  performance  here. 

•  More  holistic  types  of  analysis  that  compare  tradeoff  curves  within 
and  across  systems. 

We  will  continue  the  use  of  ClueWebl2  as  the  test  collection  for  TREC  2014. 
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Table  5:  Top  diversity  measures  ordered  by  ERR-IA@20.  Only  the  best  run 
according  to  ERR-IA@20  from  each  group  is  included  in  the  ranking. 


Group 

Run 

Cat 

Type 

ERR-IA@20 

a-nDCG@20 

NRBP 

udehfang 

UDInfolabWEB2 

A 

auto 

0.582 

0.654 

0.547 

Technion 

clustmrfaf 

A 

auto 

0.567 

0.668 

0.521 

ICTNET 

ICTNET  13RSR3 

A 

auto 

0.551 

0.627 

0.512 

uogTr 

uogTrAIwLmb 

A 

auto 

0.548 

0.637 

0.498 

udel 

udelPseudo2 

A 

auto 

0.539 

0.637 

0.486 

ut 

ut22base 

A 

auto 

0.513 

0.596 

0.470 

wistud 

wistud. runD 

A 

manual 

0.512 

0.589 

0.466 

CWI 

cwiwtl3cps 

A 

auto 

0.480 

0.557 

0.439 

diro_web_13 

udemQlmllEbWiki 

A 

auto 

0.480 

0.576 

0.433 

UJS 

UJS13Risk2 

B 

auto 

0.468 

0.539 

0.434 

webis 

webismixed 

A 

auto 

0.423 

0.516 

0.374 

RMIT 

RMITSCTh 

A 

auto 

0.388 

0.489 

0.330 

MSR_Redmond 

msr_alphal 

A 

manual 

0.368 

0.476 

0.308 

Organizers 

baseline 

A 

auto 

0.352 

0.451 

0.294 

UWaterlooCLAC 

UWCWEB13RISK02 

A 

auto 

0.323 

0.399 

0.283 

DLDE 

dlde 

B 

manual 

0.045 

0.058 

0.038 
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(b)  Bottom  25  topics  (by  descending  baseline  ERR@10 

Figure  1:  Boxplots  for  TREC  2013  Web  topics,  showing  variation  in 
ERR@10  effectiveness  across  all  submitted  runs.  Topics  are  sorted  by  de¬ 
creasing  baseline  ERR@10  (pink  bar).  Eaceted  topics  are  prefixed  with  ‘E’, 
single-facet  topics  by  ‘S’,  ambigous  topics  by  ‘A’  . 
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Figure  2:  Chart  showing  the  significant  variation  in  ERR@10  per  topic 
between  top-ranked  Technion  and  Glasgow  runs.  Topics  are  sorted  by  de¬ 
creasing  baseline  ERR@10  (pink  dashed  bar).  Faceted  topics  are  prefixed 
with  ‘F’,  single-facet  topics  by  ‘S’,  ambigous  topics  by  ‘A’. 
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